title: “52414: Home Exam”
date: “June 28th, 2020”
output: html_document
Contents:
The exam will be submitted individually by uploading the solved exam Rmd and html files to the course moodle.
Please name your files as 52414-HomeExam_ID.Rmd and 52414-HomeExam_ID.html where ID is replaced by your ID number (do not write your name in the file name or in the exam itself).
The number of points for each sub-question is indicated next to it, with \(105\) points overall. The total grade will be at most \(100\).
Once you click on the moodle link for the home exam, the exam will start and you have three days (72 hours) to complete and submit it.
The exam will be available from June 28th to July 19th. The last submission time is June 19th at 23:59.
You may use all course materials, the web and other written materials and R libraries.
You are NOT allowed to discuss any of the exam questions/materials with other students.
Data Analysis:
Some questions may require data wrangling and manipulation which you need to
decide on.
In questions involving real data, some data may be missing (‘NA’), wrong (e.g. negative values for positive variables)
or cause numerical problems (e.g. zeros which go into a log-transform). If you encounter such data, use your best judgment (e.g. excluding them, zeroing negative values, using \(log(1+x)\) etc.) and mention what you have done in your analysis.
Whenever possible, use objective and specific terms and quantities learned in class, and avoid subjective and general unquantified statements. For example:
Good: “We see a \(2.5\)-fold increase in the curve from Jan. 1st to March 1st”.
Bad: “The curve goes up at the beginning”.
Good: “The median is \(4.7\). We detected five outliers with distance \(>3\) standard deviations from the median”.
Bad: “The five points on the sides seem far from the middle”.
Presentation of Results:
Write your answers and explanations in the text of the Rmd file (not in the code).
The text of your answers should be next to the relevant code, plots and tables and refer to them, and not at a separate place at the end.
You need to explain every step of your analysis. When in doubt, a more detailed explanation is better
than omitting explanations.
Give informative titles, axis names and names for each curve/bar in your graphs.
In some graphs you may need to change the graph limits. If you do so, please include the outlier points you have removed in a separate table.
Add informative comments explaining your code
Sometimes Tables are the best way to present your results (e.g. when asked for a list of items). Exclude irrelevant
rows/columns. Display clearly items’ names in your Tables.
Show numbers in plots/tables using standard digits and not scientific display.
That is: 90000000 and not 9e+06.
Round numbers to at most 3 digits after the dot - that is, 9.456 and not 9.45581451044
Required libraries are called in the Rmd file. Install any library missing from your R environment. You are allowed
to add additional libraries if you want.
If you do so, please add them at the start of the Rmd file, right below the existing libraries, and explain what libraries you’ve added, and what is each new library used for.
Monte Carlo method is a useful tool for transforming problems of probabilistic nature into deterministic computations using the law of large numbers.
Assume that the radius \(r\) of the four overlapping white circles in the figure below is the last digit of your ID number (‘sifrat-bikoret’). If this digit is \(0\) set \(r=10\). The radius of the large circle is \(2r\).
orange area. Aside from verbal explanations, you may use pseudo-code and/or mathematical symbols to explain your algorithm.The Monte-Carlo_based algorithm astimates the area with a large sample, and beacause this is so large, it covers the area that astimaes.
After the large sampling, I used the function from Lecture 10, to get only the points in the large cirlce (because this is the limit of our area) - this function make a sample and if it is in the squere it threw it away, until it gets to n samples in the circle.
The next level is to relate to the 4 circles as one part, check if the sample in or out this 4 circle and if its out, sum it together (all the out). Than to divide it in n, and I get the probability to be in the orange area. I want to get the orange area so I need to double the probability I got in the area of the large circle and that I get the orange area
The function of the orange area looks like this:
\[OrangeArea \approx( \sum_{n=1}^{n}{\frac{1_{(\#inside)}}{n}}) \times \pi r^2\]
# the function from Lecture 10 to get only samples in the circle area:
reject_rcircle <- function(n, r)
{
x <- rep(0, n) #empty vector in length n
y <- rep(0, n) #empty vector in length n
counter <- 1
while (counter < n) #we need to get n points in the circle
{
cur_x <- runif(1, -r, r) #uniform distribution
cur_y <- runif(1, -r, r) #uniform distribution
if(cur_x^2 + cur_y^2 <= r^2) #condition to be inside the circle
{
x[counter] <- cur_x #add point to the vector x
y[counter] <- cur_y #add point to the vector y
counter <- counter+1
}
}
return(data.frame(x,y))
}
xy_df <- reject_rcircle(50000,16) # 5000 sampling and r = 8*2
ggplot(data = xy_df, aes(x, y)) + geom_point() #all the points inside the circle
# check if the point(x,y) out of the 4 small circles - and that means the orenge area:
xy_df$inside <- (xy_df$x+8)^2 + (xy_df$y)^2 <= 8^2 | (xy_df$x-8)^2 + (xy_df$y)^2 <= 8^2 | (xy_df$x)^2 + (xy_df$y+8)^2 <= 8^2 | (xy_df$x)^2 + (xy_df$y-8)^2 <= 8^2
# the formula for the orange area:
orange_area <- (sum(xy_df$inside == "FALSE") / 50000) * pi*(2*8)^2
cat("The astimate Orange area equeal to", orange_area )
## The astimate Orange area equeal to 146.1157
# just for checking I found in google this formula that should be close to the astimate:
real_orange_area <- 2*(8^2)*(pi-2)
cat("The real area of orange is:", real_orange_area, ", And the astimate area of orange is:", orange_area)
## The real area of orange is: 146.1239 , And the astimate area of orange is: 146.1157
# plot of the simulated points I have generated:
ggplot(data=xy_df, aes(x=x, y=y, group = inside))+ geom_rect(aes(xmin = -1, xmax= 1, ymin = -1, ymax = 1), col= "black") +geom_point(aes(col = inside), show.legend = FALSE) + scale_colour_manual(values = c("orange", "black"))
orange_probability <- sum(xy_df$inside == "FALSE") / 50000 # the probability to be in the orange area
orange_sd <- sqrt(orange_probability*(1- orange_probability)) # the standard deviation of the orange area
# the formula of confidence interval:
lower <- orange_probability - qnorm(0.975)*orange_sd/sqrt(50000) # alpha = 0.05 in default
upper <- orange_probability + qnorm(0.975)*orange_sd/sqrt(50000) # alpha = 0.05 in default
#find the *area* confidence interval
orange_lower <- lower * pi*(2*8)^2
orange_upper <- upper * pi*(2*8)^2
cat("Confidence interval for the area based on this standard deviation is: [", orange_lower ,",", orange_upper, "]")
## Confidence interval for the area based on this standard deviation is: [ 143.3976 , 148.8338 ]
We would like to compare and visualize the trends in terms of numbers of COVID-19 cases and deaths between different countries.
WHO-COVID-19-global-data.csv from the World’s Health Organization webpage (see the link Download Map Data on the bottom right corner of the map). The data represents the daily number of cases and deaths from COVID19 in different world countries.
Change the column representing the date to Date. Make sure that this column represents only the date and set it to ‘date’ type. For example, the first element in the ‘Date’ column should be “2020-02-24”.
Show the head and tail of the resulting data-frame.
data1 <- read.csv("C:/Users/Asus/Desktop/R_data/WHO-COVID-19-global-data.csv", header = T)
names(data1)[1] <- "Date" # change the name of the column to Date
data1$Date <- as.Date(data1$Date) # make it as.date
# show (print) the head and the tail of this data:
head(data1)
## Date Country_code Country WHO_region New_cases Cumulative_cases
## 1 2020-02-24 AF Afghanistan EMRO 1 1
## 2 2020-02-25 AF Afghanistan EMRO 0 1
## 3 2020-02-26 AF Afghanistan EMRO 0 1
## 4 2020-02-27 AF Afghanistan EMRO 0 1
## 5 2020-02-28 AF Afghanistan EMRO 0 1
## 6 2020-02-29 AF Afghanistan EMRO 0 1
## New_deaths Cumulative_deaths
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
tail(data1)
## Date Country_code Country WHO_region New_cases Cumulative_cases
## 27841 2020-07-07 ZW Zimbabwe AFRO 18 734
## 27842 2020-07-08 ZW Zimbabwe AFRO 53 787
## 27843 2020-07-09 ZW Zimbabwe AFRO 98 885
## 27844 2020-07-10 ZW Zimbabwe AFRO 41 926
## 27845 2020-07-11 ZW Zimbabwe AFRO 16 942
## 27846 2020-07-12 ZW Zimbabwe AFRO 40 982
## New_deaths Cumulative_deaths
## 27841 1 9
## 27842 0 9
## 27843 0 9
## 27844 3 12
## 27845 1 13
## 27846 5 18
Israel and its neighbours.Extract as candidate neighbours all countries with WHO_region = EMRO. Add Israel and other neighbour countries that you notice are missing, and remove far away countries (e.g. Afghanistan, Djibouti). Use your best judgment in selecting which additional countries to remove, and keep the total number of neighbour countries at below \(15\).
Replace long country names by meaningful short names for better readability and graph appearance.
For example, if Venezuela (Bolivarian Republic of) was one of our neighbours, we would have replaced it by Venezuela.
Next, plot the cumulative number of cases as a function of the Date for these countries (one plot, a different curve per each country).
Repeat and show in a different plot the cumulative number of deaths for each of these countries.
df_neighb <- data1 %>% filter(data1$WHO_region == "EMRO" | data1$Country == "Israel" | data1$Country == "Turkey" | data1$Country == "Cyprus") # filter only the EMRO region and also add Israel and Turkey and Cyprus to the data (because Turkey and Cyprus are also neighbers of Israel)
list_of_countries <- df_neighb[-which(duplicated(df_neighb$Country)),] #remove duplicate to make it easy to look at the countries
# I looked at the map in the corona site above, and I saw that Afghanistan, Bahrain, Djibouti, Morocco, Oman, Pakistan, Qatar, Somalia, Tunisia, Yemen, United Arab Emitrates, Iran, Kuwait - all are not neighbers of israel.
# remove all the countries that are not the neighbers of Israel:
df_neighb <- df_neighb %>% subset(Country != "Afghanistan" & Country != "Bahrain" & Country != "Djibouti" & Country != "Morocco" & Country != "Oman" & Country != "Pakistan" & Country != "Qatar" & Country != "Somalia" & Country != "Tunisia" & Country != "Yemen" & Country != "United Arab Emirates" & Country != "Iran (Islamic Republic of)" & Country != "Kuwait")
cat("I chose", length(unique(df_neighb$Country)), "countries that is below 15") # show that it less than
## I chose 12 countries that is below 15
df_neighb$Country <- as.character(df_neighb$Country) #make the column as character
# change the names to shortest:
df_neighb$Country <- ifelse(df_neighb$Country == "occupied Palestinian territory, including east Jerusalem", "Palestine", df_neighb$Country)
df_neighb$Country <- ifelse(df_neighb$Country == "Syrian Arab Republic", "Syrian", df_neighb$Country)
# plots of Cumulitive Cases and Deaths:
ggplot(data = df_neighb, aes(x = Date, y = Cumulative_cases, group = Country)) + geom_line(show.legend = T, aes(color = Country), size=0.8, alpha=0.9) + ylab("Cumulative cases") + xlab("Dates") + labs(title = "Cumulative number of cases as a function of the Date")
ggplot(data = df_neighb, aes(x = Date, y = Cumulative_deaths, group = Country )) + geom_line(show.legend = T, aes(color = Country), size=0.8, alpha=0.9) + ylab("Cumulative deaths") + xlab("Dates") + labs(title = "Cumulative number of deaths as a function of the Date")
lab1 from here.Merge the two data-frames such that the new data-frame will keep the information in the COVID-19 data-frame, yet will also contain for each row the total population of each country in \(2018\).
Manually rename country names that do not match between the two datasets - you don’t have to change all names, but focus on including countries that come up in the analysis of (b.) and of the next sub-questions.
Create four new columns, respectively representing the number of cumulative cases and deaths per one million people, and the number of new daily cases and deaths per one million people.
For the same countries used in (b.), plot in two separate figures
the log-scaled cumulative number of cases and deaths per million, as a function of the Date.
Which countries seem to be doing the worst based on these plots? how is Israel coping compared to its neighbours?
df_eco <- read.csv(url("https://raw.githubusercontent.com/DataScienceHU/DataAnalysisR_2020/master/data/economic_data.csv")) #read the data from the url
df_eco_pop <- df_eco %>% filter(Series.Code == "SP.POP.TOTL") # filter only the total population (this is tha code of it)
names(df_eco_pop)[1] <- "Country" # change the name of the column
df_eco_pop$Country <- as.character(df_eco_pop$Country)
# The countries that has a different name (not all nececcery but I looked at the difference):
countries_to_fix <- c("Bahamas", "Brunei", "Congo (Brazzaville)", "Congo (Kinshasa)", "Czechia", "Egypt","Gambia", "Iran", "Korea, South","Kyrgyzstan","Laos", "Russia","Slovakia", "Saint Kitts and Nevis", "Saint Lucia","Saint Vincent and the Grenadines", "Syria","The United Kingdom", "United States of America", "Venezuela", "Palestine", "Yemen")
# The places that has a different name:
indices_to_fix <- c(14, 29, 45, 46, 53, 59, 72, 92, 105,108,109,162,174,182,183,185,190, 206, 207, 211,214,215)
# changing the names:
for(i in 1:length(countries_to_fix)){
df_eco_pop$Country[indices_to_fix[i]] <- countries_to_fix[i]
}
data1$Country <- as.character(data1$Country) # make it as character
# change the names in data1 to be match to the eco:
data1$Country <- ifelse(data1$Country == "Iran (Islamic Republic of)", "Iran", data1$Country)
data1$Country <- ifelse(data1$Country == "occupied Palestinian territory, including east Jerusalem", "Palestine", data1$Country)
data1$Country <- ifelse(data1$Country == "Syrian Arab Republic", "Syrian", data1$Country)
# merging the data1 and eco_pop:
merge_covid <- merge(data1, df_eco_pop, by = "Country")
merge_covid <- merge_covid[-c(9,10,11,13,14)] # remove the unnessesery columns
names(merge_covid)[9] <- "Population_Total" # change the name of the column
merge_covid$Population_Total = as.numeric(as.character(merge_covid$Population_Total)) #fix the population total column
## Warning: NAs introduced by coercion
# the columns per one million people:
merge_covid$"Cumulative cases per 1M" <- 1000000 * (merge_covid$Cumulative_cases)/ merge_covid$Population_Total
merge_covid$"Cumulative deaths per 1M" <- 1000000 * (merge_covid$Cumulative_deaths) / merge_covid$Population_Total
merge_covid$"New cases per 1M" <- 1000000 * (merge_covid$New_cases)/ merge_covid$Population_Total
merge_covid$"New deaths per 1M" <- 1000000 * (merge_covid$New_deaths)/merge_covid$Population_Total
#the relevant columns (to prevent douplicates):
merge_covid_relev <- merge_covid[, c(1,2,9,10,11,12,13)]
# add the relevant columns to the neighbers data:
df_neighb <- inner_join(x = df_neighb, y= merge_covid_relev, by = c("Country", "Date"))
# log-scaled cumulative number of cases per million, as a function of the Date:
ggplot(data = df_neighb, aes(x = Date, y = log(1 + df_neighb$`Cumulative cases per 1M`))) + geom_line(show.legend = T, aes(color = Country), size=0.8, alpha=0.9) + ylab("Log of Cumulative Cases") + xlab("Dates") + labs(title = "Log of Cumulative number of cases as a function of the Date")
## Warning: Use of `df_neighb$`Cumulative cases per 1M`` is discouraged. Use
## `Cumulative cases per 1M` instead.
# log-scaled cumulative number of deaths per million, as a function of the Date:
ggplot(data = df_neighb, aes(x = Date, y = log(1+df_neighb$`Cumulative deaths per 1M`))) + geom_line(show.legend = T, aes(color = Country), size=0.8, alpha=0.9) + ylab("Log of Cumulative deaths") + xlab("Dates") + labs(title = "Log of Cumulative number of deaths as a function of the Date")
## Warning: Use of `df_neighb$`Cumulative deaths per 1M`` is discouraged. Use
## `Cumulative deaths per 1M` instead.
#In the y - I add df_neighb$`Cumulative cases/deaths per 1M` and not just the name of the column because it sometimed didnt find it so I wanted to make sure that it will not make problems although I know it is in double places (also in data)
The countries that seems to be the worst are Iraq and Saudi Arabia (because of the exponentian increasing) and also Sudan because the exponantial increasing . They have also the highest value of deaths. The country that has the highest value of cases is Saudi Arabia, And Israel is the next one. Israel is the forth place in the death cases, so it is not so good. But we can see that during April-May, Israel succeed to get to moderate condition and now in the last month we see some increasing that we pray it will not increase more.
the population age-structure, the fact that testing and diagnosing cases are different between countries, and that the pandemic is still ongoing and more deaths are expected in the future for current cases).
Calculate for each country the total cases per million and total deaths per million (It is recommended to create a new data-frame with one row per country), and make a scatter-plot comparing the two shown on a log-scale.
Fit a linear regression line to the log-scaled data.
Define the fatality rate as the ratio of the total deaths to the total cases. Display the distribution of fatality rate across countries using a plotting method of your choice. Describe the distribution: what is the mean/median? is it symmetric? skewed? are there outlier countries? which ones?
# agrregation of sum of the New cases per 1M (because the last day in the data is not equal in all the countries:
total_cases_1M <- aggregate(`New cases per 1M` ~ Country, data = merge_covid , FUN = sum)
total_deaths_1M <- aggregate(`New deaths per 1M` ~ Country,data = merge_covid, FUN = sum)
total_both_1M <- inner_join(x = total_cases_1M, y= total_deaths_1M, by = "Country")
# change the column names:
names(total_both_1M) <- c("Country", "Total cases per 1M", "Total deaths per 1M")
# Scatter-plot comparing the Log of Total Cases per 1M. VS the Log of Total Deaths per 1M:
ggplot(total_both_1M, aes(x=log(1+`Total cases per 1M`), y=log(1+`Total deaths per 1M`))) + geom_point()+ geom_smooth(method = 'lm') + xlab("Log of total death per 1M") + ylab("Log of total cases per 1M")+ labs(title = "Scatter-plot comparing the Total Cases per 1M VS. Total Deaths per 1M (Log)")
## `geom_smooth()` using formula 'y ~ x'
# agrregation of sum of the New cases (because the last day in the data is not equal in all the countries:
total_cases <- aggregate(New_cases ~ Country, data = merge_covid , FUN = sum)
total_deaths <- aggregate(New_deaths ~ Country, data = merge_covid , FUN = sum)
total_both <- inner_join(x = total_cases, y= total_deaths, by = "Country")
# change the column names:
names(total_both) <- c("Country", "Total cases", "Total deaths")
# make a column of fatality rate:
total_both$Fatality_rate <- total_both$`Total deaths` / total_both$`Total cases`
# parameters to analize the distribution:
mean_fatality <- round(mean(total_both$Fatality_rate),3 )
med_fatality <- round(median(total_both$Fatality_rate),3 )
skewness_fatality <- round(skewness(total_both$Fatality_rate),3 )
kurtosis_fatality <- round(kurtosis(total_both$Fatality_rate),3 )
# the Distributon plot, with the mean and median:
ggplot(total_both, aes(x = Fatality_rate, fill = "Fatality_rate")) + geom_density(show.legend = FALSE) + scale_fill_brewer(palette = "Dark2") + labs(title = "Fatality Distribution", color = "Fatality_rate") + guides(fill=guide_legend(title="Distribution")) + xlab("Fatality Rate") + geom_vline(xintercept = mean(total_both$Fatality_rate), col="Red") + geom_vline(xintercept = median(total_both$Fatality_rate), col="Blue") + scale_x_continuous(breaks=seq(0,40,5)) +
ggtitle(label = "Fatality Rate Distribution", subtitle = paste("Red line = mean at", mean_fatality, ", Blue line = median at", med_fatality))
# Boxplot to see that there are outliers:
ggplot(total_both) + geom_boxplot(aes(y = Fatality_rate), outlier.color = "red") +labs(title= "Box plot of Fatality Rate")
# the outliers countries are above 0.1 so I put it in one data frame:
outliers_countries <- total_both %>% filter(Fatality_rate > 0.1) %>% arrange(desc(Fatality_rate))
outliers_countries
## Country Total cases Total deaths Fatality_rate
## 1 Yemen 1389 365 0.2627790
## 2 France 161275 29907 0.1854410
## 3 Belgium 62606 9782 0.1562470
## 4 The United Kingdom 288957 44798 0.1550334
## 5 Italy 242827 34945 0.1439090
## 6 Hungary 4229 595 0.1406952
## 7 British Virgin Islands 8 1 0.1250000
## 8 Netherlands 50866 6128 0.1204734
## 9 Mexico 295268 34730 0.1176220
## 10 Spain 253908 28403 0.1118634
cat("The mean of Fatality Rate is:", mean_fatality, ", median:", med_fatality, ",The Skewness:", skewness_fatality, ",And the kurtosis is:", kurtosis_fatality)
## The mean of Fatality Rate is: 0.032 , median: 0.021 ,The Skewness: 2.708 ,And the kurtosis is: 10.302
We can understand from the plots and the parameters (mean,median,skewness,kurtasis) that the fatality distribution (that it is importent to notice that it is non-negative) is not normal and there is a very long right tail. The fact that the mean and median are so little, means that in the most countries there have a lot of corona cases, and much less death cases (relately a good situation). There are 10 outliers (detailed in the table) what means that in those countries the death and cases are close (relately to others).
total deaths.For these countries, plot the smoothed number of new daily cases.
You can use the geom_smooth function.
Identify at least three different qualitative behaviors of the curves of the different countries. Which countries seemed to have overcome the pandemic?
in which countries it is still rising? are their countries with a second wave?
Do you see different patterns between different countries?
# The countries with above 10000:
worst_covid <- total_both %>% filter(total_both$`Total deaths` > 10000)
# add the other columns from the merge covid:
worst_covid <- inner_join(x = worst_covid, y = merge_covid, by = "Country")
# changing names of worst countries with long name for the plot:
worst_covid$Country <- ifelse(worst_covid$Country == "The United Kingdom", "UK", worst_covid$Country)
worst_covid$Country <- ifelse(worst_covid$Country == "United States of America", "Us", worst_covid$Country)
# smooth plot of the countries worst hit corona:
ggplot(data = worst_covid) + geom_smooth(aes(x = Date, y = New_cases, color = Country)) + ylab("New daily cases") + labs(title = "Plot of Smoothed number of new daily cases")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
While I look at the smooth plot I can see very clearly that the three behaviors are: sharp increasing, moderate and than increase again (second wave) and recoverd.
Easy to see that US has the sharp increasing, and it starts in March until middle of April, than the increasing calm down and at June they has a second wave with a very sharp increasing.
Brazil also has an increasing from May until now without any breaks. In addition, India has an increasing sinse May.
We can also see that Peru and UK have a deacreasing what means that they are recovered
Iran stay moderate and also Spain and Italy since May
So the countries that have overcome the pandemic are Peru And UK because of the decreasing.
I defenetly see the different pattern because most of Europe are in some moderate situation, while US, India and Brazil are Increasing more and more.
countries that are at the peak of the disease, and countries that have passed the peak but still see many cases, or are facing a ‘second wave’.
Create a new column for the merged data-frame containing a smoothed version of the daily number of new cases per million, where smoothing is performed using moving-average with a moving-average “window” of width two-weeks (14 days). You can use for example the runmean command from the caTools package.
For each country compute the maximum number of smoothed new daily cases, representing the average number of new daily cases at the peak of the epidemic.
Similarly, compute for each country the last number of smoothed new daily cases, representing the current average number of daily cases. Make sure that the last number is indeed an average of the last \(14\) days and not fewer days (e.g. a single day).
We define a country as recovered if we have \(\frac{last}{maximum} < 0.01\), i.e. the current number of daily cases is less than \(1\%\) than the number at the peak of the epidemic.
Similarly, we define a country as exponentially increasing if we have \(\frac{last}{maximum} > 0.99\),
i.e. it seems that the country still hasn’t reached its peak.
struggling if we have \(0.25 < \frac{last}{maximum} < 0.75\),i.e. it seems that the country has passed its peak, but the epidemic is still active.
Determine the countries belonging to each of the three groups: recovered, exponentially increasing and struggling
(many countries do not belong to any of these three groups).
Next, for each of the three groups, make a figure showing all countries in this group in separate plots.
For each country, plot the smoothed number of new daily cases, divided by its max, as a function of the Date. Do the curves representing the different countries look similar within a group? do they look similar for countries from different groups?
For which group can you spot an indication for the start of a second wave of the pandemic? for which countries?
# ordering the data by Date and than by Country:
merge_covid <- merge_covid[order(merge_covid$Date),]
merge_covid <- merge_covid[order(merge_covid$Country),]
# the merge data with the unique country (1 from each):
unique_country <- unique(merge_covid$Country)
unique_country <- as.character(unique_country)
# reset the vectors:
daily_cases <- c()
maxi <- c()
last <- c()
# loop that filter each country and use the runmean (from the question recommandation) for 14 days, that take the maximun and the last day of each country.
for(i in 1:length(unique_country)){
daily <- merge_covid %>% filter(Country == unique_country[i])
aver_daily <- runmean(daily$`New cases per 1M`, 14)
maxi <- c(maxi, max(runmean(daily$`New cases per 1M`, 14)))
last <- c(last, aver_daily[length(aver_daily)])
daily_cases <- c(daily_cases, aver_daily)
}
#add column to the merge_covid of the daily mean cases:
merge_covid$mean_daily <- daily_cases
#make a data frame of the 3 relevant vectors:
peak_covid <- data.frame(unique_country, maxi, last)
#change the name of the first column:
names(peak_covid)[1] <- "Country"
#now reset column:
peak_covid$covid_group <- 0
peak_covid <- peak_covid %>% filter(Country != "Eritrea") #remove the NA country
# loop that add a column that divide to 3 different groups, according to the question details:
for(i in 1:length(peak_covid$last)){
if(peak_covid$last[i]/peak_covid$maxi[i] < 0.01){
peak_covid$covid_group[i] <- "recovered"
} else if(peak_covid$last[i]/peak_covid$maxi[i] > 0.99){
peak_covid$covid_group[i] <- "exponentially increasing"
} else if(peak_covid$last[i]/peak_covid$maxi[i] > 0.25 & peak_covid$last[i]/peak_covid$maxi[i] < 0.75){
peak_covid$covid_group[i] <- "struggling"
}
}
# 3 dataframes of the 3 groups separately:
struggling_group <- peak_covid%>% filter(covid_group == "struggling")
increasing_group <- peak_covid%>% filter(covid_group == "exponentially increasing")
recovered_group <- peak_covid%>% filter(covid_group == "recovered")
# loop that add the smoothed daily (dividing in max the daily average):
for(i in 1:length(unique_country)){
daily <- merge_covid %>% filter(Country == unique_country[i])
merge_covid$smoothed_daily <- merge_covid$mean_daily / maxi[i]
}
struggling_group <- struggling_group[,-c(2,3)] #remove the unnesesery columns
increasing_group <- increasing_group[,-c(2,3)] #remove the unnesesery columns
recovered_group <- recovered_group[,-c(2,3)] #remove the unnesesery columns
merge_covid_struggling <- merge(merge_covid, struggling_group, by = "Country") #filter only the group in the merge_covid
merge_covid_increasing <- merge(merge_covid, increasing_group, by = "Country") #filter only the group in the merge_covid
merge_covid_recovered <- merge(merge_covid, recovered_group, by = "Country") #filter only the group in the merge_covid
#plots of the different groups by date and daily smoothed:
ggplot(data = merge_covid_increasing, aes(x = Date, y = smoothed_daily)) + geom_point(aes(color = Country) ) +labs(title = "Increasing Group")
ggplot(data = merge_covid_struggling, aes(x = Date, y = smoothed_daily)) + geom_point(aes(color = Country) ) +labs(title = "Struggling Group")
ggplot(data = merge_covid_recovered, aes(x = Date, y = smoothed_daily)) + geom_point(aes(color = Country) ) +labs(title = "Recovered Group")
I can find the similarity in the groups, the has the same wave. That means that it seperate the 3 groups correctly
A Linear Congruential Generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a modular linear equation. The method represents one of the oldest and best-known pseudo-random number generator algorithms. The theory behind them is relatively easy to understand, and they are easy to implement and fast, especially on computer hardware which can provide modular arithmetic by storage-bit truncation.
The generator is defined by the recurrence relation:
\(X_{n + 1} = (a X_n + c) \: mod \: (m)\)
where \(X_1,X_2,...\) is the sequence of pseudo-random values in the range \([0, m)\). All values \(a,c,m,X_i\) are nonnegative integers and:
\(m\) is the “modulus”, \(m > 0\).
\(a\) is the “multiplier”, \(0 < a < m\).
\(c\) is the “increment”, \(0 \leq c < m\).
\(X_0\) is the “seed” or “start value”, \(0 \leq X_0 < m\).
To produce pseudo-random numbers in the range \([0,1]\) we can simply divide and output the numbers \(\frac{X_i}{m-1}\).
The following visual example shows generating sequences of random numbers using the LCG for different parameters and seeds.
Write your own LCG function that implements an LCG-based pseudo-random number generator.
The function should accept the parameters \(m\), \(a\), \(c\), the current state \(X_0\) of the LCG and the number of random numbers \(n\) to generate.
The function should advance the state \(n\) times and return a vector of \(n\) pseudo-random numbers in the range \([0,1]\) based on the states, and also return the final state of the LCG \(X_n\) (i.e. the new seed).
# LCG Function:
LCG_fun <- function(n, m, a, c, x0) {
lcg_vec <- c(x0) #vector includes only the x0
for (i in 1:n) {
xn <- lcg_vec[length(lcg_vec)] #last place
xn1 <- (a * xn + c) %% m #the function
lcg_vec <- c(lcg_vec, xn1) #add the next part to the vector
}
return(lcg_vec)
}
since the 1960’s. In this IBM.LCG the modulus is \(m=2^{31}\), the multiplier
is \(a=2^{16}+3\), and the increment is \(c=0\).
Set the seed to your ID number (‘teudat zehut’), generate \(2000\) consecutive pseudo-random numbers from the IBM.LCG and
divide them to \(1000\) consecutive pairs denoted \((x_i,y_i), i=1,...,1000\).
Create a scatter plot of the pairs. Does the spread of the points in the \([0,1]^2\) square seem to match i.i.d. uniform samples from this square?
#definding the variables (by the data that given me in the question):
m <- 2^31
a <- 2^16+3
c <- 0
n <- 2000 #1000 of x and 1000 of y
x0 <- 206096588 #id number
id_fun <- LCG_fun(n, m, a, c, x0)
consecutive_pairs <- data.frame(rep(0, 1000), rep(0, 1000))
names(consecutive_pairs) <- c("x", "y")
sequence <- seq(1, 1000)
consecutive_pairs$x <- id_fun[sequence*2] #the even places
consecutive_pairs$y <- id_fun[sequence*2-1] #the odd places
ggplot(data = consecutive_pairs) + geom_point(aes(x=x, y=y, color = "red"), show.legend = FALSE) + labs(title = "Scatter plot of the consecutive pairs")
The spread of the points in the \([0,1]^2\) square very match i.i.d. uniform samples , as we can see that the points looks like a cloud so the “different” (“shonut”) is equal and there is not heteroscedastic, What means that this methot for randomation is works.
in a sample of \(n\) points, vs. the observed number \(o_{ij}\) in your simulated data from (b.). Compute the goodness-of-fit test statistic for the data:
$$
S = _{i,j=1}^{10} .
$$
For the null hypothesis \(H_0: (x_i, y_i) \sim [0,1]^2 \: i.i.d.\) we know that (approximately) \(S \sim \chi^2(B-1)\). Compute a p-value for this hypothesis testing problem and the statistic \(S\) you have computed. Would you reject the null hypothesis or uniform distribution?
norm_pairs <- consecutive_pairs/(m-1) # normalize pairs by divide it in the freedom degrees
points <- 1000 #(x,y) points in the plot
eij <- 10 #the expected number of points in each bin
B <- 10^2 #from the question
bins <- matrix(nrow = 10, ncol = 10, data = 0) #10 bins
S <- 0
#calculate blocks - that represent in this case Oij by for loops
for(i in 1:10){ #the bins rows
for (j in 1:10){ #the bins columns
Oij <- norm_pairs %>% filter((x >= (i-1)/10) & (x < i/10) & ((j-1)/10 <= y) & (y < j/10))
bins[i,j] <- sum(((i-1)/10 <= norm_pairs$x) & (norm_pairs$x < i/10) &
(j-1)/10 <= norm_pairs$y & (norm_pairs$y < j/10))
#calculate s according to the formula
S = S + (bins[i,j]-eij)^2/eij
}
}
cat("The S equal to:" , S)
## The S equal to: 83.4
# I want to check - reject the H0 hypothesis or this is a uniform distribution: (default - alpha = 0.05)
# checking if the S I got is bigger than Chi statistic with 99 degrees of freedom and 0.95 significance
cat("Check if the S I got higher than chi statistic:", S > qchisq(0.95,df=99))
## Check if the S I got higher than chi statistic: FALSE
# calculate p.value:
p <- 1-pchisq(q = S, df=99)
cat("Check if the P value is higher than the alpha:", p > 0.05)
## Check if the P value is higher than the alpha: TRUE
cat("I got that the p value is higher than the alpha and equal to:", round(p,3), "so we will not regect the H0 hypothesis")
## I got that the p value is higher than the alpha and equal to: 0.87 so we will not regect the H0 hypothesis
runif) and compute the statistic \(S\) from above for each sample, generating random statistics \(S_1,..,S_{n_{iters}}\). Plot the empirical density distribution of the \(S_i\) test statistics and compare it to the theoretical density of the \(\chi^2(B-1)\) distribution. Are the distributions similar?
Compute the empirical p-value of the test statistic \(S\) from (c.), defined as \(1-\hat{F}_{n_{iters}}(S)\)
where \(\hat{F}_{n_{iters}}\) is the empirical CDF of \(S_1,..,S_{n_{iters}}\). Does it match the theoretical p-value from (c.)?
statistic <- rep(0, 10000)
probabilities <- rep(0, 10000)
# sampling of points from uniform distribution:
for (k in 1:10000){
x <- runif(1000)
y <- runif(1000)
eij <- 10
blocks <- matrix(nrow = 10, ncol = 10, data = 0)
S <- 0
for(i in 1:10){
for (j in 1:10){
bins[i,j] = sum(((i-1)/10 <= x) & (x < i/10) & (j-1)/10 <= y & (y < j/10))
S <- c(S + (bins[i,j]-eij)^2/eij) #the formula of S
}
}
statistic[k] <- S
probabilities[k] <- pchisq(S, 99)
}
#Plot the empirical density distribution of statistics and compare it to the theoretical density of chi distribution.
#plot empirical density distribution vs Theoretical density of chi
plot(density(statistic), col="orange",lwd=2,xlab = "Statistic", main="Empirical density distribution Vs. Theoretical density of Chi distribution")
curve(dchisq(x,df=99),add=TRUE,col="green")
#Compute the empirical p-value of the test statistic from c
#i will calculate all the varibles of the given formula:
F_statistic<- ecdf(statistic) #using ecdf function build in R
empirical_p_val <- 1- F_statistic(S)
cat("The Theoretical p.value I caclutated in the previous section: ",round(p,3)," The Empirical p.value I caclutated here:", round(empirical_p_val,3))
## The Theoretical p.value I caclutated in the previous section: 0.87 The Empirical p.value I caclutated here: 0.515
# the P value are close
consecutive triplets denoted \((x_i,y_i,z_i), i=1,...,1000\). Create a 3-dimensional scatter plot of the simulated data using the package plotly and the command plot_ly.
Does the spread of the points in the \([0,1]^3\) cube seem to match i.i.d. uniform samples from this cube?
hint: you can rotate the 3-dim. plot in different directions. Look at the plot from the z-axis looking down on the x-axis and y-axis. What is your conclusion?
Note: the 3-dim. plot generated by plotly may not be observable when viewing the knitted html file from within \(R\). Use a web-browser like chrome, firefox etc. to view the html file and plot
m <- 2^31
a <- 2^6+3
c <- 0
n <- 3000 # 3 dimantions
x0 <- 206096588
id_fun <- LCG_fun(n, m, a, c, x0)
#consecutive triplets
consecutive_triplets <- as.data.frame(cbind(rep(0, 1000), rep(0, 1000), rep(0,1000)))
names(consecutive_triplets) <- c("x", "y", "z") #changing names of columns
sequence <- seq(1, n/3)
consecutive_triplets$x = as.vector(id_fun[sequence*3]) #defined it as vector
consecutive_triplets$y = as.vector(id_fun[sequence*3-2]) #defined it as vector
consecutive_triplets$z = as.vector(id_fun[sequence*3-1]) #defined it as vector
consecutive_triplets %>% plot_ly(x=~x, y=~y, z=~z) # 3d plot
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
IBM.LCG.#according to the options in Wikipedia (https://en.wikipedia.org/wiki/Linear_congruential_generator - the first option):
m <- 2^32
a <- 1664525
c<- 1013904223
n <- 3000
x0 <- 206096588
random_id <- LCG_fun(n, m, a, c, x0)
consecutive_triplets_1 <- as.data.frame(cbind(rep(0, 1000), rep(0, 1000), rep(0,1000)))
names(consecutive_triplets_1) <- c("x", "y", "z")
sq = seq(1, n/3)
consecutive_triplets_1$x <- as.vector(random_id[sq*3]) #defined it as vector
consecutive_triplets_1$y <- as.vector(random_id[sq*3-2]) #defined it as vector
consecutive_triplets_1$z <- as.vector(random_id[sq*3-1]) #defined it as vector
consecutive_triplets_1 %>% plot_ly(x=~x, y=~y, z=~z) # 3d plot
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
The changes made the scatter plot be more random as I wanted, so indeed solves the problems encountered in the IBM.LCG.*